Hardware Architecture 2
♻ ☆ LoopTree: Exploring the Fused-layer Dataflow Accelerator Design Space
Latency and energy consumption are key metrics in the performance of deep
neural network (DNN) accelerators. A significant factor contributing to latency
and energy is data transfers. One method to reduce transfers or data is reusing
data when multiple operations use the same data. Fused-layer accelerators reuse
data across operations in different layers by retaining intermediate data in
on-chip buffers, which has been shown to reduce energy consumption and latency.
Moreover, the intermediate data is often tiled (i.e., broken into chunks) to
reduce the on-chip buffer capacity required to reuse the data. Because on-chip
buffer capacity is frequently more limited than computation units, fused-layer
dataflow accelerators may also recompute certain parts of the intermediate data
instead of retaining them in a buffer. Achieving efficient trade-offs between
on-chip buffer capacity, off-chip transfers, and recomputation requires
systematic exploration of the fused-layer dataflow design space. However, prior
work only explored a subset of the design space, and more efficient designs are
left unexplored.
In this work, we propose (1) a more extensive design space that has more
choices in terms of tiling, data retention, recomputation and, importantly,
allows us to explore them in combination, (2) a taxonomy to systematically
specify designs, and (3) a model, LoopTree, to evaluate the latency, energy
consumption, buffer capacity requirements, and off-chip transfers of designs in
this design space. We validate our model against a representative set of prior
architectures, achieving a worst-case 4% error. Finally, we present case
studies that show how exploring this larger space results in more efficient
designs (e.g., up to a 10$\times$ buffer capacity reduction to achieve the same
off-chip transfers).
comment: To be published in IEEE Transactions on Circuits and Systems for
Artificial Intelligence
♻ ☆ Research Directions for Verifiable Crypto-Physically Secure TEEs
A niche corner of the Web3 world is increasingly making use of hardware-based
Trusted Execution Environments (TEEs) to build decentralized infrastructure.
One of the motivations to use TEEs is to go beyond the current performance
limitations of cryptography-based alternatives such as zero-knowledge proofs
(ZKP), fully homomorphic encryption (FHE), and multi-party computation (MPC).
Despite their appealing advantages, current TEEs suffer from serious
limitations as they are not secure against physical attacks, and their
attestation mechanism is rooted in the chip manufacturer's trust. As a result,
Web3 applications have to rely on cloud infrastruture to act as trusted
guardians of hardware-based TEEs and have to accept to trust chip
manufacturers. This work aims at exploring how we could potentially architect
and implement chips that would be secure against physical attacks and would not
require putting trust in chip manufacturers. One goal of this work is to
motivate the Web3 movement to acknowledge and leverage the substantial amount
of relevant hardware research that already exists. In brief, a combination of:
(1) physical unclonable functions (PUFs) to secure the root-of-trust; (2)
masking and redundancy techniques to secure computations; (3) open source
hardware and imaging techniques to verify that a chip matches its expected
design; can help move towards attesting that a given TEE can be trusted without
the need to trust a cloud provider and a chip manufacturer.